Dataset: a tidy data set from 2009 testing the chemical properties of Vinho Verde red wine. At least 3 wine experts rated the quality of each wine between 0 (very bad) and 10 (very excellent). More information can be found here..

Question: What variables contribute to making the best quality Red Wine?

1. Initial Analysis: structure of the dataset.

Dimensions?

## [1] 1599   13

Variables?

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Field Descriptions:

  1. Fixed acidity: most wine acids involved are fixed or nonvolatile (do not evaporate readily)
  2. Volatile acidity: amount of acetic acid in wine - can be unpleasant, vinegary taste if too high?
  3. Citric acid: found in small quantities, can add ‘freshness’ and flavor to wines
  4. Residual sugar: sugar remaining after fermentation stops, rare < 1 gram/liter, > 45 grams/liter are considered sweet
  5. Chlorides: amount of salt in the wine
  6. Free sulfur dioxide: the free form of SO2 - prevents microbial growth and the oxidation of wine
  7. Total sulfur dioxide: free + bound forms of S02; in low concentrations, mostly undetectable in wine, free SO2 over 50 ppm, evident in the nose and taste of wine
  8. Density: the density of water is close to that of water (approx 1) depending on the percent alcohol and sugar content
  9. pH: acidic on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  10. Sulphates: anadditive which can contribute to S02 levels, acts as an antimicrobial and antioxidant.
    11.Alcohol: the percent alcohol content of the wine
  11. Quality (Output Variable) - sensory score between 0 and 10

Structure?

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Descriptive Stats?

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Thoughts at this stage - Quality range is between 3 and 8 - does this correlate with anything?

Mean/Median seem to be relatively close on all variables except total.sulfur.dioxide and chlorides - long tailed?

2. Data Transformation (Univariate analysis)

Next step to look at the distribution of each individual variable and how a log10 or sqrt transformation might affect this.

  1. Fixed acidity: Remove outliers

Transformations have an effect but once 8 outliers are removed (> 14) the distribution is quite normal. I also checked the outliers in case there were any significant results. As the quality is all between 5 and 7 where the majority of the results fall this is fine.

Quality Table: acidity > 14

##    
##     FALSE TRUE
##   3    10    0
##   4    53    0
##   5   677    4
##   6   637    1
##   7   196    3
##   8    18    0

  1. Volatile acidity: removed outliers

As with fixed acidity. The best option seemed to be to remove 4 outliers (>1.2)

Quality Table: volatile acidity > 1.2

##    
##     FALSE TRUE
##   3     9    1
##   4    53    0
##   5   678    3
##   6   638    0
##   7   199    0
##   8    18    0

  1. Citric acid

There are 132 0 values which seem to be pretty evenly distributed on quality so neither of the transformations are suitable. The best I can do here is remove 6 outliers which spreads the distribution out a bit.

Quality Table: citric acid == 0

##    
##     FALSE TRUE
##   3     7    3
##   4    43   10
##   5   624   57
##   6   584   54
##   7   191    8
##   8    18    0

Quality Table: citric acid > 0.75

##    
##     FALSE TRUE
##   3    10    0
##   4    52    1
##   5   679    2
##   6   637    1
##   7   197    2
##   8    18    0

  1. Residual sugar This time I went for a log transformation. Removing outliers >6 (see graph) did had a significant impact on the distribution but there were 46 records which I felt was too many.

Quality Table: residual sugar > 6

##    
##     FALSE TRUE
##   3    10    0
##   4    51    2
##   5   658   23
##   6   624   14
##   7   191    8
##   8    17    1

Outliers over 6 looks evenly spread.

  1. Chlorides: For chlorides I went for a Log 10 transformation as it produced a central relatively evenly spread curve.

Quality Table: chlorides > 0.25

##    
##     FALSE TRUE
##   3     9    1
##   4    52    1
##   5   665   16
##   6   632    6
##   7   198    1
##   8    18    0

  1. Free sulfur dioxide
    Sqrt transformation - this normalises better than log 10. Removing 4 outliers doesn’t alter the distribution significantly.

Quality Table: free sulfur dioxide > 60

##    
##     FALSE TRUE
##   3    10    0
##   4    53    0
##   5   678    3
##   6   637    1
##   7   199    0
##   8    18    0

  1. Total sulfur dioxide

For total SO2 I chose to remove 3 outliers (>160). The distribution is still heavily positively skewed but neither transformation option seems to have a significant effect. The 3 outliers were quite high in quality so I double checked the mean for all data > 150 in case this was a significant trend.

Quality Table: total sulfur dioxide > 160

##    
##     FALSE TRUE
##   3    10    0
##   4    53    0
##   5   681    0
##   6   637    1
##   7   197    2
##   8    18    0

Double checking means (new then old).

## [1] 5.666667
## [1] 5.636023

  1. Density: No change - normal already

  2. pH: no change - normal already

  1. Sulphates: I decided on log10 transformation. The outliers have low quality values (all under 6) which may be signficant.

Quality Table: sulphates > 1.5

##    
##     FALSE TRUE
##   3    10    0
##   4    52    1
##   5   677    4
##   6   635    3
##   7   199    0
##   8    18    0
  1. Alcohol:

No change - transformations have no significant impact, neither does removing the outliers.

Quality Table: alcohol > 14

##    
##     FALSE TRUE
##   3    10    0
##   4    53    0
##   5   680    1
##   6   638    0
##   7   199    0
##   8    18    0

  1. Quality (Output Variable) - sensory score between 0 and 10. Not sensible to change it.

The combined outlier removal resulted in 20 less rows:

## [1] 1579   13

3. GG Pairs (Bivariate Analysis)

Below is the ggpairs for the transformed data. I’ve colour coded the Pearson’s R results as follows:

yellow (0.3-0.5 small meaning) orange (0.5-0.7 moderate) red (0.7+ pretty large)

The corresponding plots are highighted in green.

What does this mean though?

I ended up plotting it on a flow diagram using draw.io. Colour coding applies to the arrows.

Correlations map:

Conclusions:

  1. Although Free and Total SO2 are highly correlated, they do not correlate with any other variables so I am not going to analyse them further.

  2. Residual sugar seems to have a small correlation to density but nothing else. This may be due to the fact that vinho verde red wine isn’t intended to be sweet. I won’t analyse that further either.

  3. Start off with 3 critical factors - Alcohol, Volatile Acid and Sulphates.

  4. Link these critical factors to related variables and see if I can see any further patterns.

4. Quick look at Quality

There majority of the data has quality level 5,6,7.

5. Critical Factor Analysis (Multivariate Analysis)

1. ALCOHOL

Some stats first:

## Source: local data frame [6 x 4]
## 
##   quality alcohol_mean alcohol_median     n
##     (int)        (dbl)          (dbl) (int)
## 1       3     9.850000           9.90     9
## 2       4    10.281731          10.00    52
## 3       5     9.880655           9.70   672
## 4       6    10.631234          10.50   635
## 5       7    11.474180          11.50   193
## 6       8    12.094444          12.15    18

Percent with quality 5,6,7 = 0.9493433

Initial Analysis:

95% of the data is in the quality bracket 5-7 The alcohol content goes up in line with the quality. Nothing <10 for top quality, nothing over 12 for lower The majority of mid range wines (5 & 6) = 10

In more detail..

Once the quality gets to 5 there is a clear positive trend.

What about density?, fixed acidity, chlorides, citric acid and pH?
How do they affect the alcohol level.
What makes it high?

I started with density. The density of the wine is known to be linked to the alcohol content so this correlation is not a surprise.

There seems to be a negative linear correlation between alcohol and density - the content is higher when the density is lower.

How about fixed acidity and density?

There seems to be a positive linear correlation here. The higher the density, the higher the fixed acidity.

How about chlorides and density?

Slight positive trend here but high distribution of quality. It seems more of a cluster that a trend and the quality is dispersed - dead end?

How about citric acid and density?

The smoother is confusing here. There is no clear linear regression. However, the correlation between citric acid and quality looks quite strong.

Citric acid is “found in small quantities and can add freshness and flavour to wines”. This could clearly have an impact on quality despite the fact that the quality ratings for 0 values are well distributed.

To double check, lets see what happens when we remove the 0 values.

Pearsons’R on quality is

## [1] 0.2349862

This is not wildly significant. Adding this to the fact that the data we removed did not show a trend of low quality data I’d say this is a dead end as well.

And lastly pH?

Looks like a strong negative trend here…

But does pH figure? Tricky to tell. The Ph levels do seem higher at the ‘positive’ end?

Possible hypothesis:

low density levels + low fixed.acidity + higher ph levels = higher alcohol content = better quality wine?

2. VOLATILE ACIDITY

Some stats first:

## Source: local data frame [6 x 4]
## 
##   quality vol.acid_mean vol.acid_median     n
##     (int)         (dbl)           (dbl) (int)
## 1       3     0.8072222          0.8150     9
## 2       4     0.6973077          0.6725    52
## 3       5     0.5741518          0.5800   672
## 4       6     0.4985276          0.5000   635
## 5       7     0.4057254          0.3700   193
## 6       8     0.4233333          0.3700    18

It looks like a gradual descent in acidity from low to high quality. The mean and median quite similar.

Not much more to glean from the histograms really. Distributions seem pretty similar on first glance.

This plot does reflect the downward trend but there is a lot of noise. The smoother may be misleading?
Let’s have a look at mean and quantiles. Also let’s check if it’s worth adding log10 or sqrt transformations.

No to transformation. Slight downward trend still visible.

Where’s the correlation between Volatile and Fixed Acidity?

Seems to be a bit of a cluster - the majority of results have low fixed and volatile acidity. It looks like fixed acidity and volatile acidity are related and pretty constant - as you’d expect.

There seems to be no particular correlation with quality so perhaps a dead end?

Possible hypothesis - Volatile Acidity may have a small impact on quality - the lower the acidity, the better the wine. This may or may not be significant.

3. SULPHATES

Some basic analysis first:

## Source: local data frame [6 x 4]
## 
##   quality sulph_mean sulph_median     n
##     (int)      (dbl)        (dbl) (int)
## 1       3  0.5888889         0.55     9
## 2       4  0.5694231         0.56    52
## 3       5  0.6194792         0.58   672
## 4       6  0.6755906         0.64   635
## 5       7  0.7436788         0.74   193
## 6       8  0.7677778         0.74    18

Looks like a pretty clear positive trend in both mean and median.

Lets look at a facet grid - I’ve transformed sulphates by log10 as that was the decision I made earlier.

Histograms back this up..

What about mean and quantiles?

Slight positive trend still visible..

What about citric acid?

The distribution is pretty constant. There seem to be more low quality wines for lower sulphate levels (as we discovered earlier) but the addition of citric acid levels does not impact that in any way.

or chlorides?

This backs up the theory that higher sulphate levels may = better quality wine.

There is a clear constant and logical cluster between chlorides and sulphates so chlorides are unlikely to have a follow through impact on quality.

Possible hypothesis - Sulphates may have a small impact on quality - the higher the sulphate level, the better the wine.

6. Final Three Plots

Three possible hypotheses so far:

  1. low density levels + low fixed.acidity + higher ph levels (over 3?) = higher alcohol content = better quality wine
  2. higher sulphate levels = better the wine
  3. lower volatile acidity = better the wine

Plot One: Alcohol, Quality and related values - following the trail

Explanation:

Possible Conclusion:

low density + low fixed acidity + high ph = high alcohol = better wine?

Here’s the plot to back me up…

A bit more statistical analysis:

## 
## Calls:
## m4: lm(formula = I(alcohol) ~ I(quality) + density + fixed.acidity + 
##     pH, data = rwine)
## 
## ===============================
##   (Intercept)      440.405***  
##                    (13.269)    
##   I(quality)         0.371***  
##                     (0.023)    
##   density         -448.051***  
##                    (13.486)    
##   fixed.acidity      0.452***  
##                     (0.019)    
##   pH                 3.247***  
##                     (0.159)    
## -------------------------------
##   R-squared             0.6    
##   adj. R-squared        0.6    
##   sigma                 0.7    
##   F                   564.3    
##   p                     0.0    
##   Log-likelihood    -1623.6    
##   Deviance            722.9    
##   AIC                3259.3    
##   BIC                3291.5    
##   N                  1579      
## ===============================

Conclusion:

The graphs indicate a clear link between alcohol content, density and wine quality. Factoring in a lower fixed acidity level and a higher pH level makes the regression stronger.

The stats are pretty good too. R-squared implies that 60% of the variance is explained by our variables. The F-score, Deviance, AIC and BIC are all very high though. Is this the result of linking 5 such diverse variables?

Plot Two: Sulphate and Quality

Explanation:

There is a visible positive trend between quality and sulphates.

Possible Conclusion:

higher sulphate levels = better quality wine

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

A bit more statistical analysis:

## 
## Calls:
## m1: lm(formula = I(log10(sulphates)) ~ I(quality), data = rwine)
## 
## =============================
##   (Intercept)     -0.411***  
##                   (0.016)    
##   I(quality)       0.038***  
##                   (0.003)    
## -----------------------------
##   R-squared            0.1   
##   adj. R-squared       0.1   
##   sigma                0.1   
##   F                  182.1   
##   p                    0.0   
##   Log-likelihood    1546.1   
##   Deviance            13.0   
##   AIC              -3086.3   
##   BIC              -3070.2   
##   N                 1579     
## =============================

Conclusion: The plot shows a clear positive correlation between sulphates and wine quality but the stats tell a different story.

R-squared only explains 10% of the variance which does not indicate high significance. The F-score is high, possibly too high, as are the AIC, BIC and Deviance. All in all I’d conclude that the results are unlikely to be significant but I’d keep observing sulphate leves when a new dataset comes in.

Plot Three: Volatile Acidity and Quality

Explanation:

There is a visible negative trend between quality and volatile acidity.

PossibleConclusion:

lower volatile acidity levels = better quality wine

A bit more statistical analysis:

## 
## Calls:
## m1: lm(formula = I(volatile.acidity) ~ I(quality), data = rwine)
## 
## =============================
##   (Intercept)      0.993***  
##                   (0.029)    
##   I(quality)      -0.083***  
##                   (0.005)    
## -----------------------------
##   R-squared            0.1   
##   adj. R-squared       0.1   
##   sigma                0.2   
##   F                  269.9   
##   p                    0.0   
##   Log-likelihood     647.2   
##   Deviance            40.7   
##   AIC              -1288.4   
##   BIC              -1272.3   
##   N                 1579     
## =============================

Conclusion: The plots shows a visible link between Volatile Acidity and Quality but as with Sulphates the statistics don’t back this up very well.

R-squared only explains 10% of the variance which does not indicate high significance. The F-score is quite high as are the AIC, BIC and Deviance. Again, I’d conclude that the results are unlikely to be significant but I’d keep observing volatile acidity levels when a new dataset comes in.

7. Overall Conclusion:

The results indicate that there is a strong likelihood that the following factors contribute to high quality wine:

They also indicate a possible linke between these variables, but for this dataset it does not look significant.

The ideal follow-up would be a matching analysis on subsequent years. The dataset was recorded in 2009. If similar data has been gathered over the past six years that would allow us to come to more reliable, robust conclusions.

It is also worth bearing in mind that the ‘quality’ is a sensory score given by 3 professionals. If the same professionals were involved each year that would make the analysis more robust.

8. Reflection

This project has been a learning curve in action. It has been a while since I took the Data Analysis with R module so I’ve had to refamilarise myself with RStudio, ggplot and a lot more. After the 1st submission I had to rewrite my R file in RMD as I forgot about this and was working on an RScript!

It’s been interesting as back in January I had a go analysing the white wine data using Excel and Python. This was arduous and inconclusive so it’s been great to really dig into the red wine data and come to some logical conclusions.

I started with a GGPair and quickly realised that I would have to transform the data first as the distribution was only normal on a couple of variables. My first mistake was to try and hurry the process and plot standard, log10 and sqrt histograms for all the univariates on one chart. While this saved spaced, I missed various nuances.

After the 1st submission I looked again in more detail and ended up removing outliers for three of the variables instead. This was time consuming but very valuable.

The other thing I had difficulty with was interpreting the GGPair and therefore deciding where to go next. I found the flow diagram really useful both for deciding the critical factors and the variables that might impact them.

For the critical factors, it was the challenge of getting to know ggplot again and working out the best geom and scale to convey meaning. I found the percentiles and mean very useful. The quality scale took a while but I got there in the end.

I also failed to look at statistics first time round. The lm stats were really helpful either in supporting or making me sceptical about the plot indications.

Overall, this has been a positive experience though as I now feel confident enough to approach a new R analysis with new data.

A possible next step could be to apply machine learning: Can I predict top quality wines based on the links I’ve discovered? What does KMeansBest say about my features? Perhaps I could create some new composite features? PCA of alcohol/density springs to mind?

Or I could repeat the process with white wine data and look for similarities and differences?